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Abstract. Multi-task learning leverages shared information among data 
sets to improve the learning performance of individual tasks. The paper 
applies this framework for data where each task is a phase-shifted peri- 
odic time series. In particular, we develop a novel Bayesian nonparamet- 
ric model capturing a mixture of Gaussian processes where each task is 

^ j , a sum of a group-specific function and a component capturing individual 

J ■ variation, in addition to each task being phase shifted. We develop an 

efficient em algorithm to learn the parameters of the model. As a spe- 
cial case we obtain the Gaussian mixture model and em algorithm for 
phased-shifted periodic time series. Furthermore, we extend the proposed 
model by using a Dirichlet Process prior and thereby leading to an infi- 
nite mixture model that is capable of doing automatic model selection. A 
Variational Bayesian approach is developed for inference in this model. 
Experiments in regression, classification and class discovery demonstrate 
the performance of the proposed models using both synthetic data and 

JSJ ' real-world time series data from astrophysics. Our methods are partic- 

ularly useful when the time series are sparsely and non-synchronously 
sampled. 
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1 Introduction 
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j^ , In many real world problems we are interested in learning multiple tasks while 

J-j ■ the training set for each task is quite small. For example, in pharmacological 

studies, we may be attempting to predict the concentration of some drug at dif- 
ferent times across multiple patients. Finding a good regression function of an 
individual patient based only on his or her measurements can be difficult due to 
insufficient training points for the patient. Instead, by using measurements across 
all the patients, we may be able to leverage common patterns across patients 
to obtain better estimates for the population and for each patient individually. 
Multi-task learning captures this intuition aiming to learn multiple correlated 
tasks simultaneously. This idea has attracted much interest in the literature and 
several approaches have been applied to a wide range of domains including med- 
ical diagnosis [1], recommendation systems [2] and HIV Therapy Screening [3]. 



This is an extended version of our ECML 2010 paper. 



Building on the theoretical framework for single-task learning, multi-task learn- 
ing has recently been formulated by [4] as a multi-task regularization problem 
in vector-valued Reproducing Kernel Hilbert space. 

Several approaches formalizing multi-task learning exist within Baycsian 
statistics. Considering hierarchical Baycsian models [5,6], one can view the pa- 
rameter sharing of the prior among tasks as a form of multi-task learning where 
evidence from all tasks is used to infer the parameters. Over the past few years, 
Baycsian models for multi-task learning were formalized using Gaussian pro- 
cesses [7-9]. In this mixed-effect model, information is shared among tasks by 
having each task combine a common (fixed effect) portion and a task specific 
portion, each of which is generated by an independent Gaussian process. 

Our work builds on this formulation extending it and the associated algo- 
rithms in several ways. In particular, we extend the model to include three new 
aspects. First, we allow the fixed effect to be multi-modal so that each task may 
draw its fixed effect from a different cluster. Second, we extend the model so that 
each task may be an arbitrarily phase-shifted image of the original time series. 
This yields our GMT model: the shift-invariant grouped mixed-effect model. Al- 
ternatively, our model can be viewed as a probabilistic extension of the Phased 
K-means algorithm of [10] that performs clustering for phase-shifted time series 
data and as a non-parametric Bayesian extension of mixtures of random effects 
regressions for curve clustering [11]. Finally, unlike the existing models that re- 
quire the model order to be set a priori, our extension in the DP-GMT model 
uses a Dirichlet process prior on the mixture proportions so that the number 
of mixture components is adaptivcly determined by the data rather than being 
fixed explicitly. 

Our main technical contribution is the inference algorithm for the proposed 
model. We develop details for the em algorithm for the GMT model and a Vari- 
ational em for DP-GMT optimizing the maximum a posteriori (MAP) estimates 
for the parameters of the models. Technically, the main insights are in estimat- 
ing the expectation for the coupled hidden variables (the cluster identities and 
the task specific portion of the time series) and in solving the regularized least 
squares problem for a set of phase-shifted observations. In addition, for the DP- 
GMT, we show that the variational em algorithm can be implemented with the 
same complexity as the fixed order GMT without using sampling. Thus the DP- 
GMT provides an efficient model selection algorithm compared to alternatives 
such as BIC. As a special case our algorithm yields the (Infinite) Gaussian mix- 
ture model for phase shifted time series, which may be of independent interest, 
and which is a generalization of the algorithms of [10] and [11]. 

Our model primarily captures regression of time series but because it is a 
generative model it can be used for class discovery, clustering, and classifica- 
tion. We demonstrate the utility of the model using several experiments with 
both synthetic data and real-world time series data from astrophysics. The ex- 
periments show that our model can yield superior results when compared to 
the single-task learning and Gaussian mixture models, especially when each in- 
dividual task is sparsely and non-synchronously sampled. The DP-GMT model 



yields results that are competitive with model selection using BIC over the GMT 
model, at much reduced computational cost. 

The remainder of the paper is organized as follows. Section 2 provides an 
introduction to the multi-task learning problem and its Bayesian interpretation 
and develops the main assumptions of our model. Section 3 defines the new 
generative model, Section 4 develops the EM algorithm for it, and the infinite 
mixture extension is addressed in Section 5. The experimental results are re- 
ported in Section 6. Related work is discussed in Section 7 and the final section 
concludes with a discussion and outlines ideas for future work. 



2 Preliminaries 

Throughout the paper, scalars are denoted using italics, as in x, y £ H; vectors 
use bold typeface, as in x,y, and Xi denotes the ith entry of x. For a vector x 
and real valued function / : H — > IR, we extend the notation for / to vectors so 
that f(x) = [/(a?i), ■ • • , f(x n )] T where the superscript T stands for transposition 
(and the result is a column vector). /C(-, ■) denotes a kernel function associated 
to some reproducing kernel Hilbert space (RKHS) % and its norm is denoted as 
\-n. To keep the notation simple, ^ 7 - =1 is substituted by J^ ■ where the index 



j is not confusing. 



2.1 Multi-task learning with kernel 



Given training set T> = {sEj, j/j}, i = 1, • • • , N, where Xi = [xn, Xi2, • • • , £id] T £ 
X C IR , single-task learning focuses on finding a function / : X — > IR that best 
fits and generalizes the observed data. In the regularization framework, learning 
/ amounts to solving the following variational problem [12, 13] 

f* = argmin j £ V(f(x t ), Vi ) + \\\ff n 1 (1) 

where V(-,-) is some (typically convex) loss function. The norm || • \\-u relates to 
regularity condition on the function where a large norm penalizes non-smooth 
functions. The regularization parameter A provides a tradeoff between the loss 
term and the complexity of the function. 

Consider a set of M tasks, with jth task X> J = (xj,yf),i = 1,2, ••• ,rij. 
Multi-task learning seeks to find /•? for each task simultaneously, which, assuming 
square loss function, can be formulated as the following regularization problem 



{, M n-j 
T7 EE^ - /'N)) 2 + APENtf 1 , f 



J j )> (2) 



where the penalty term, applying jointly to all the tasks, encodes our prior 
information on how smooth the functions are, as well as how these tasks are 



correlated with each other. For example, setting the penalty term to ^2 ,- ll/^llw 
implies that there is no correlation among the tasks. It further decomposes the 
optimization functional to M separate single-task learning problems. On the 
other hand, with a shared penalty, the joint regularization can lead to improved 
performance. Moreover, we can use a norm in RKHS with a multi-task kernel to 
incorporate the penalty term [14]. Formally, consider a vector- valued function 
/ : X ->• H M defined as / = [/\/ 2 ,--- ,/ M ] T . Then Equation (2) can be 
written as 

{ M nj ~\ 

3 = 1 i=l J 

where || • \\u is the norm in RKHS with the multi-task kernel Q : (A, X) x 
(A, X) — > IR, where A = {1, 2, • • • , M }. As shown by [4], the representer theorem 
gives the form of the solution to Equation (3) 

M rij 

/'(•) = EE^'2((^).N.i)) (4) 

with norm 

rip rifc 

ii/ii 2 s = EEE c ^2((^^)'K fc ' fc ))- 

^,fc i=l j=\ 

LctC = [c{, C i,--- ,<J T ,Y= foj.j/i,.. • ,y^] T 6E E ^ andX = \x\,x\,- ,<J, 
then the coefficients {c^} are given by the following linear system 

(Q + AI)C = Y (5) 

where Q G IR^' "j x ^j n i is the kernel matrix formed by X. 

2.2 Bayesian formulation 

A Gaussian process is a functional extension for Multivariate Gaussian distribu- 
tions. In the Bayesian literature, it has been widely used in statistical models by 
substituting a parametric latent function with a stochastic process with a Gaus- 
sian prior [15]. More precisely, under the single-task setting a simple Gaussian 
regression model is given by 

V = f(x) + e 

where /'s prior is a zero mean Gaussian process with covariance function /C and 
e is independent zero mean white noise with variance a 2 . Given data set V = 
{xi,yi},i = l,--- ,N, let K = (/C(xi,Xj))*,j, then f = [/(an),--- J{x N )] T ~ 
A/ r (0, K) and the posterior on f is given by 

Pr(f \V) = M{K{a 2 I + K^y, a 2 {a 2 l + K) _1 K). 



The predictive distribution for some test point x* distinct from the training 
examples is 

Pr(/(0|x„2>)= /pr(/(x«)|x,,/)Pr(/|D)d/ 

= JV (k(x*) T {a 2 I + K)- 1 ?/, JC(x*, x*) - k(x*) T (a 2 I + K)- 1 ^*)) 

where k(x») = [K,(xi,x*), ■ ■ ■ , K(xn, a;»)] T . Furthermore, under square loss 
function, the optimizer of Equation (1) is equal to the expectation of the pre- 
dictive distribution [15]. Finally, a Gaussian process / corresponds to a RKHS 
% with kernel /C such that 

cov[f(x),f(y)]=lC(x,y) Vx,y € X. (6) 

In this way, we can express a prior on functions / using a zero mean Gaussian 
process [16] 3 

/~expj-i||/|&}. (7) 

Applying this framework in the context of multi-task learning, the model is given 
by 

yl=fi( X l) + eij 

where f 3 are zero mean Gaussian processes and eij captures i.i.d. zero-mean 
noise with variance a 2 . [9] formalize the connection between multi-task kernel Q 
and covariancc function among {f 3 } using 

C ov[f(x),f j (x')] = Q((x,i),(x',j)), i,j = l,---,M. (8) 

2.3 Basic Model Assumptions 

Given data {T> 3 }, the so-called nonparamctric Baycsian mixed-effect model [16, 
9] captures each task f 3 with respect to V 3 using a sum of an average effect 
function and an individual variation for each specific task, 

f(x) = f(x) + f(x), j = l,---,M. 

This assumes that the fixed-effect (mean function) / is sufficient to capture 
the behavior of the data, an assumption that is problematic for distributions 
with several modes. To address this, we introduce a mixture model allowing 
for multiple modes (just like standard Gaussian mixture model (GMM)), but 
maintaining the formulation using Gaussian processes. This amounts to adding 
a group effect structure and leads to the following assumption: 



3 In general, a Gaussian process can not be thought of as a distribution on the RKHS, 
because with probability 1, one can find a Gaussian process such that its sample 
path does not belong to the RKHS. However, the equivalence holds between the 
RKHS and the expectation of a Gaussian process conditioned on a finite number 
of observations. For more details on the relationship between RKHS and Gaussian 
processes we refer interested reader to [17]. 



Assumption 1 For each j and x G X , 

f(x) = f z .(x) + f(x), j = l,---,M (9) 

where {f s },s = !,■■■ ,k and f 3 are zero-mean Gaussian processes and Zj G 
{1, • ■ • , k}. In addition, {f s } and f 3 are assumed to be mutually independent. 

With the grouped-effect model and groups predefined, one can define a kernel 
that relates (with non zero similarity) only points from the same example or 
points for different examples but the same center as follows 

Q((x,i),(x',j)) = S ZiiZi K, z .(x,x') + 5ijJCi(x,x') 

where 

ilC Zt (x,x') = cov[f Zi (x), f Zi {x% 
\}C i (x,x')=cov[p(x),p(x')}. 

However, in our work the groups arc not known in advance and we cannot use 
this formulation. Instead we use a single kernel to relate all tasks. 

The second extension allows us to handle phase shifted time series. In some 
applications, we face the challenge of learning a periodic function f 3 : IR — > IR 
on a single period T from samples V = {x 3 ,y 3 },x 3 ,y 3 G H" j ,j = 1, • • ■ , M, 
where similar functions in a group differ only in their phase. In the following 
assumption, the model of primary focus in this paper is presented, which ex- 
tends the mixed-effect model to capture both shift-invariancc and the clustering 
property. 

Assumption 2 For each j and x G [0,T), 

f(x) = [f Z] * S tj ](x) + f j (x), ,j = 1, ■ ■ ■ ,M (10) 

where Zj G {1, • • • , k} , {/ s }, s = 1, • • • , k and f 3 are zero-mean Gaussian pro- 
cesses, * stands for circular convolution and St is the Dirac 5 function with 
support at tj G [0,T). 4 In addition, {/s},/- 7 are assumed to be mutually inde- 
pendent. 



4 Given a periodic function / with period T, its circular convolution with another 
function h is defined as 

ftn+T 



(/*/>)(*)= f° f(t-r)h(r)dr 

Jtn 



'to 

where to is arbitrary in IR and / * h is also a periodic function with period T. Using 
the definition we see that, 

/*M*) = /(*-**)> 

and thus * performs a right shift of / or in other words performs a phase shift of tj 
on /. 



3 Shift- invariant Grouped mixed-effect model 

In Assumption 1, if wc know the cluster assignment of each task, then the model 
decomposes to k mixed-effect models which is the case investigated in [9,4]. 
Similar results can be obtained for Assumption 2. However, prior knowledge 
of cluster membership is often not realistic. In this section, based on Assump- 
tion 2, a probabilistic generative model is formulated to capture the case of 
unknown clusters. We start by formally defining the generative model, which we 
call Shift- invariant Grouped mixed-effect Model (GMT). In this model, k group 
effect functions are assumed to share the same Gaussian prior characterized by 
/Co- The individual effect functions are Gaussian processes with covariance func- 
tion /C. The model is shown in Figure 1 and it is characterized by parameter set 
Ai = {/C07 /C, a , {tj}j °~ 2 } an d summarized as follows 

1. Draw/.|/C ~ejcp{-|||/ a ||| t0 } ) s = l,2,---,k 

2. For the jth time series 

— Draw Zj\a ~ Discrctc(a) 

-Draw/'|AC~exp{-§||/'|&} 

- Draw y* | Zj , P , x> ,t 3 ,a 2 ~Af (fj ( X J ) , a 2 l) , where f* = f Z] * S tj + fi 




Fig. 1. GMT: Plate graph 



where a is the mixture proportion. Additionally, denote X = {x l ,x 2 



,x M } 

and y = {y 1 , y 2 , • • • , y }, where x^ are the time points when each time series 
is sampled and y J are the corresponding observations. 

Wc assume that the group effect kernel /Co and the number of centers k are 
known. The assumption on /Co is reasonable, in that normally we can get more 
information on the shape of the mean waveforms, thereby making it possible 
to design kernel for T-Lq. On the other hand, the individual variations are more 



arbitrary and therefore /C is not assumed to be known. The assumption that 
k is known requires some form of model selection. An extension using a non- 
parametric Bayesian model, the Dirichlet process [18], that does not limit k is 
discussed in the section 5. The group effect {/ s }, individual shifts {tj}, noise 
variance a 2 and the kernel for individual variations K are unknown and need 
to be estimated. The cluster assignments {zA and individual variation {/•'} 
are treated as hidden variables. Note that one could treat {f s } too as hidden 
variables, but we prefer to get a concrete estimate for these variables because of 
their role as the mean waveforms in our model. 

The model above is a standard model for regression. We propose to use 
it for classification by learning a mixture model for each class and using the 
Maximum A Posteriori (MAP) probability for the class for classification. In 
particular, consider a training set that has L classes, where the jth instance 
is given by V? = (x^y^o 1 ) G H" 3 x 1R" J x {1,2, ••• ,L}. Each observation 
(a;- 7 , yi) is given a label from {1, 2, • • • , L}. The problem is to learn the model 
M^ for each class (L in total) separately and the classification rule for a new 
instance (x* , y*) is given by 

o= argmax Pr(y*|a;*; M e ) Pr(f). (11) 

1={1,-,L} 

As we show in our experiments, the generative model can provide explanatory 
power for the application while giving excellent classification performance. 

4 Parameter Estimation 

Given data set V = {x- 7 ,^/- 7 } = {xl,y^},i = 1, ■ • • ,rij,j = 1, ■ • • ,M, the 
learning process aims to find the MAP estimates of the parameter set M. = 
{a,{/J,{t,},a 2 ,/C} 

M* = argmax (Pr0>|#; M) X Pr[{/ S }; K ]) . (12) 

M 

The direct optimization of Equation (12) is analytically intractable because of 
coupled sums that come from the mixture distribution. To solve this problem, we 
resort to the em algorithm [19]. The em algorithm is an iterative method for opti- 
mizing the maximum likelihood (ML) or MAP estimates of the parameters in the 
context of hidden variables. In our case, the hidden variables are z = {zj} (which 
is the same as in standard GMM), and f = {f,- = P(&)},j = 1, • • • , M. The 
algorithm iterates between the following expectation and maximization steps 
until it converges to a local maximum. 

4.1 Expectation step 

In the E-stcp, we calculate 

Q(M, M 9 ) = E {m ,tw,M'} N {Pr(^ f, *|*; M) X Pr[{/ S }; £„]}] (13) 



where M 9 stands for estimated parameters from the last iteration. For our 
model, the difficulty comes from estimating the expectation with respect to the 
coupled latent variables {z,f}- In the following, we show how this can be done. 
First notice that, 

Pr(z, f | X, y; M 9 ) = JJ Pr(z„ f, -\X,y-, M 9 ) 



and further that 

Pv(z j ,f j \X,y;M 9 )=Pv(z j \x j ,y^,M 9 ) x Pv(f j \z j ,x j ,y j ;M 9 ). (14) 

The first term in Equation (14) can be further written as 
PT(zj\x j ,y j ;M 9 ) xPr(z :j ;M 9 )Pr(y i \z :i ,x j ;M 9 ) 

= Pr( Zj ;M 9 ) [pi(yl,f j \z j ,x?;M'')df j (15) 

= Py{z 3 ;M 9 ) J Pr(y»'|£j, Zj , **; M 9 ) PrfaM")^ 

where Pt(zj; A4 9 ) is specified by the parameters estimated from last iteration. 
Since Zj is given, the second term is the marginal distribution that can be 
calculated using a Gaussian process regression model. In particular, denoting 
f J = Szi *5t j {x J ) we get 



N 



p 




'K 9 + a 2 l 


K ; 





5 


[ K 9 


K« 



where K^ is the kernel matrix for the jth task using parameters from last iter- 
ation, i.e. K? = (/C(a;^, x\))u. Therefore, the marginal distribution is 



y j \zj ~7V(F,K? + a 2 I). 



(16) 



Next consider the second term in Equation (14). Given Zj, we know that f J = 
f z . + f J , i.e. there is no uncertainty about the identity of f z . and therefore 
the calculation amounts to estimating the posterior distribution under standard 
Gaussian process regression, that is 



P~c^{--\\P\\l 



yi -f> -A/"(/ j (a; J ),cr 2 I) 
1 
2 

and the conditional distribution is given by 



where [A is the posterior mean 



M^KKKf + ^I)- 1 ^-^) 



(17) 
(18) 



and C 9 is the posterior covariance of fj- 

q = K 9 - Kf (K? + a 2 !)- 1 ^ 9 . (19) 

Since Equation (15) is multinomial and f, is Normal in (17), the marginal dis- 
tribution of fj is a Gaussian mixture distribution given by 

Pi(f j\x j ,y j ;M 9 ) = ^ Pr 0j = s|a; J ',y J '; Al 9 ) 

To work out the concrete form of Q(.M, .M 9 ), denote z^; = 1 iff z% = I. Then the 
complete data likelihood can be reformulated as 

c = Pi(y,f,*-,x,M) 

= Hl[[a s J>v(y^{ j \z j = s;M)] Zjs 

j s 

= n n [°- pi (v j & > z i= s -> m ) pr & ; ^)] zjs 

where we have used the fact that exactly one Zj s is 1 for each j and included the 
last term inside the product over s for convenience. Then Equation (13) can be 
written as 

s 

Denote the second term by Q. By a version of Fubini's theorem [20] we have 
Q = TE{z\x,y;Mn}^{f\z,x,y;M3} [log£] 

= X> W ^M»){x;i>, (20) 

x fdPv&lzj = a) tog [aaPrd^'lf,,^ = S ;X)Pr(f j; X)] I. 

Now because the last term in Equation (20) does not include any z,, the equation 
can be further decomposed as 

j s \ z / 

X j / d Pr(f, |^ = s) \og[a s Pr(y^ |f, , Zj = s; X) Pr(f, ; X)] j 

= E E 7i. / d Pr & te = s ) lo § [«« Pr (^ > z ^ = s ; M ) Pr & ; *0] 

= EE^ s %l^^^^) [loga s +log(Pr(y J |f J ,z J = s; .M)) + log(Pr(f,-; Af))] 

(21) 



where 

-y- -Mz- W at-M*\ ~ Pr fe = ^^-M 9 ) (22) 

l3S - Hz 3S \y ,x,m\- EsFr{z . = s{xJiyj . Mg) W 

can be calculated from Equation (15) and (16) and jj 3 can be viewed as a frac- 
tional label indicating how likely the jth task is to belong to the sth group. Recall 
that Pr(r/ J '|fj, Zj — s) is a normal distribution given by Af([f z - *5t-](x : >)+fj, a 2 I) 
and Pr(fj; M) is a standard multivariate Gaussian distribution determined by 
its prior 

Pr(f 7 ;X) = - l =exp(-^ffK- 1 f ? l . 
Using these facts and Equation (21), Q(A4, M 9 ) can be re-formulated as 

Q{M,M 9 ) = ~\Y,WUn -E^ b ^ + EE^ lo S^ 

s 3 J s 

ijEE^tti^-''*'^} [ii^ _ [/- *^]( sB, ')-fiii 2 ] 



2ct 2 

- 5^ log |Kj| - - 5^ ^ 7i«E { fi l«, 



i,yi;M<>} \ T j ^j r iJ 

(23) 

We next develop explicit closed forms for the remaining expectations. For 
the first, note that for x ~ A/"(/x, S) and a constant vector a, 

E[||a-*|| 2 ]=E[||a|| 2 -2<a,*> + H| a ] 

= ||a|| 2 - 2(a,E[a?]> + E[a;] 2 + Tr(S) 
= ||a- At || 2 + Tr(S). 

Therefore the expectation is 

JE {f ^=*,*W;A< S } [llv 1 ' - [Is * St^x*) i 3 \\ 2 ] = J-^Tr(q) 

, (24) 

+ ^EE^ (ii^ - [/-*^i(« i )-Mi.n a ) 

j * 

where (tx JS = Er f .i z ._ s -a! j . y i :j v(9} [fj] is as i n Equation (18) where we set Zj = s 
explicitly. 

For the second expectation we have 

^{t j \z i =s,mi,yi;Ms} ( f j K j f i) =^{t i \z i =a,a>i,yi;MB} [ Tr ( f j K j f j)] 



- 1®{t j \z j =8,m*,yi;M'>} [ Tr ( K j f J f j )] 

= Tr (%| 2j . =8|a .j, y i.^j}[K: fjf, ]) 
= Tr(KTi ( q+/x? s (M? s ) T )). 



4.2 M-step 

In this step, we aim to find 

M* = argmax Q(M,M 9 ) (25) 

and use Ai* to update the model parameters. Using the results above this can 
be decomposed into three separate optimization problems as follows: 

M* = argmax Qi (({/«}, {<^-},<r)) 



M 



h(JC) + {J2Y,^ h z ( 



3 s 

That is, a can be estimated easily using its separate term, Qi is only a function 
of ({/ s }, {tj}, <t) and Q2 depends only on /C, and we have 

gi({/ s },fe}, ( T 2 ) = i^||/ s ||^ +J2n j loga+ ^L^Tr(q) 

3 (26) 

i » 
and 

Q*w = -^E^i^-i -5EE^(^( C ' + W.) T )) ■ (27) 

The optimizations for Qi and Q2 are described separately in the following two 
subsections. 

Learning {/«}, {*?}) c 2 To optimize Equation (26) we assume first that a is 
given. In this case, optimizing {/ s }, {tj} decouples into k sub-problems, finding 
sth group effect f„ and its corresponding shift {tj}. Denoting the residual y J = 
y J — jjLjs, where \ij S = TE[fj\y J , Zj = s], the problem becomes 

rij 

argmin I -^ "£ 7js J^(yj - [f *6 tj ]{x\)f + -||/|& }. (28) 

Note that different x J , y J have different dimensions rij and they are not assumed 
to be sampled at regular intervals. For further development, following [9], it 
is useful to introduce the distinct vector x G IR whose component are the 
distinct elements of X. For example if x 1 = [l,2,3] T ,a; 2 = [2,3,4, 5] T , then 
x = [1, 2, 3, 4, 5] T . For jth task, let the binary matrix C k be such that 

x j = C j -x, fix 1 ) = C j -f{x). 



That is, & extracts the values corresponding to the jth task from the full 
vector. If {tj} are fixed, then the optimization in Equation (28) is standard and 
the representer theorem gives the form of the solution as 

IN 

/(•) = X>M^,-). (29) 

Denoting the kernel matrix as & = Ko(xi,Xj), i,j — 1, • • • , IN , c = [ci, • • • , cin] t 
and we get f(x) — Ac. To simplify the optimization we assume that {tj} 
can only take values in the discrete space {ti,--- ,£l}, that is, tj = U, for 
some i G 1,2, ••• , Z (e.g., a fixed finite fine grid), where we always choose 
ii = 0. Therefore, we can write [/ * S tj ] {x) = ICjc, where K, tj is K,o(x, {{x — tj) 
mod T]). Accordingly, Equation (28) is reduced to 

argmin J V yj ,\\ff - & ■ Kj. c|| 2 + ±c T ilc I . (30) 

cem IN ,*i,-,tie{fi} I -, 2 

To solve this optimization, we follow a cyclic optimization approach where we 
alternate between steps of optimizing / and {tj} respectively, 

— At step £, optimize equation (30) with respect to {tj} given c^\ Since c^) 
is known, it follows immediately that Equation (30) decomposes into M 
independent tasks, where for the jth task we need to find t- such that 
C J '/C^)C is closest to y J under the Euclidean distance. A brute force search 

3 

with time complexity 0(JNL) yields the optimal solution. If the time series 
are synchronously sampled (i.e. C 3 ' = I, j = 1, • • • , M), this is equivalent to 
finding the shift r corresponding the cross- correlation, defined as 

C(u, v) = max(u, v +T ) (31) 

T 

where u = M.c and v = y- 7 and v +r refers to the vector v right shifted by r 
positions, and where positions are shifted modulo IN. Furthermore, as shown 
by [21], if every a;- 7 has regular time intervals, we can use the convolution 
theorem to find the same value in (IN log IN) time, that is 



tj = argmax ( &~ 



<% -V 



(r) (32) 



where J?" J [•] denotes inverse Fourier transform, • indicates point- wise mul- 
tiplication; a i/ is the Fourier transform of u and "f is the complex conjugate 
of the Fourier transform of v. 
— At step £+1, optimize equation (30) with respect to c^ e+1 ' given t± , • ■ ■ , t M . 

least square problem can be reformulated as 



For the jth task, since t, is known, denote 0K , {l) as 2Jc . The regularized 



argmin \ ^ lja \\ff - Ttfcf + -c T £c I . (33) 



Taking derivatives of Equation (33), we see that the new c^ +1 ^ value is 
obtained by solving the following linear system 

-2 ]T lj8 ■ (Ttff (ff Wlf ■ c) + Jte = 0. (34) 

3 

Obviously, each step decreases the value of the objective function and therefore 
the algorithm will converge. 

Given the estimates of {/ s }, {tj}, the optimization for a 2 is given by 



a* = argmirJ ^ n 3 logcr + — -j ^ Tr(C?) 

pEE^« ("^ - if>^]( X j )-n JS \\ 2 ) \ 

j s ) 



where {/*} and {£*} are obtained from the previous optimization steps. Let 
R = Ej Tr(Cf) + Zj Es lis {\W ~ [ft * h\{x j ) - » 3 s\\ 2 ). Then it is easy to 
see that (a*) 2 =R/Y JJ n r 

Learning the kernel for individual effect [16] have already shown how to 
optimize the kernel function in a similar context. Here we provide some of the 
details for completeness. If the kernel function /C admits a parametric form with 
parameter 8, for example the RBF kernel 

lC(x,y)=acxpl —^ — \ (36) 

where 9 = {a, s}, then the optimization of the kernel K, amounts to finding 6* 
such that 

^=argmax|-i^log|(K J ;0)|-i^^ 7js Tr((K,;0)- 1 (q + ^( At fJ T ))|. 

(37) 
It is easy to sec the gradient of the right hand side of Equation (37) is 



2 ^ V J 89 J 2 ^ ^ u " V 3 86 

(38) 

Therefore, any optimization method, e.g. conjugated gradients can be utilized 
to find the optimal parameters. Notice that given the inverse of kernel matrix 
{Kj}, the computation of the derivative requires 0(J2 n 2 ) steps. The parametric 
form of the kernel is a prerequisite to perform the regression task when examples 
are not sampled synchronously as in our development above. 



If the data is synchronously sampled, for classification tasks we only need 
to find the kernel matrix K for the given sample points and the optimization 
problem can be rewritten as 

K* = argmaxj - i J>g |K| - \ ^^V* ( K_1 ( C ? + "?.(/£) T )) }• 

(39) 

Similar to maximum likelihood estimation for multivariate Gaussian distribu- 
tion, the solution is 

K * = 17 EE^( c f + K>;«) T )- ( 4 °) 

3 s 

In our experiments, we use both approaches where for the parametric form 
we use the RBF kernel as outlined above. 

4.3 Algorithm Summary 

The various steps in our algorithm and their time complexity are summarized 
in Algorithm 1. 

Once the model parameters M are learned (or if they are given in advance), 
we can use the model to perform regression or classification tasks. The following 
summarizes the procedures used in our experiments. 

— Regression: To predict a new sample point for an existing task (task j) we 
calculate its most likely cluster assignment Zj and then predict the y value 
based on this cluster. Concretely, Zj is determined by 

Zj = argmax [Pr(zj = s\x J , y J ; M)] (41) 

S={1,~ ,fc} 

and given a new data point x, the prediction y is given by 

y=[f Zj *S tj ](x) + P(x). 

— Classification: For classification, we get a new time series and want to pre- 
dict its label. Recall from Section 3, Equation (11) that we learn a separate 
model for each class and predict using 

o= argmax Pr(y|x;M^) Pr(^). 
e={i,--,L} 

In this context, Pr(£) is estimated by the frequencies of each class and the 
likelihood portion is given by first finding the best time shift t for the new 
time series and then calculating the likelihood according to 

Pr(y|:r; M t ) = £ Pr(z|X £ ) Pr(y|z, x; M t ) (42) 

z 

where Mi is the learned parameter set and the second term is calculated 
via Equation (16). 



Algorithm 1 EM algorithm for Shift-invariant GMT 



Initialize {/j 0) }, {tf >}, a<°> and /C<°\ 



repeat 

Calculate K' according to x J ,/C'* -1 '. The time complexity for constructing 
kernel are <9(X] n j) an d <9(1) in parametric and nonparametric case respec- 
tively. 
4: Calculate yj S according to Equation (22). For each task, we need to invert 

the covariance matrix in the marginal distribution and then calculate the 
likelihood, thus the time complexity is OQ^n^). 
5: for all s such that < s < k do 

6: Update a (t) such that ai t} = ]T . Jj s /M. 

7: repeat 

8: Update {tj} w.r.t. cluster s such that tj 6 {ti, ■ ■ ■ , ti,} and minimize 

||jp — C J ■ K-J.ci '\\ 2 . The time complexity is <9(LIN) as discussed 
above. 
9: Update c s by solving linear system Equation (34), which requires 

<9(1N 3 ). 
10: until converges or reach the iteration limit 

11: end for 

12: Update a^ t+1 ' according to Equation (35). 

13: Update the parameters of the kernel or the kernel matrix directly via opti- 

mizing Equation (37) or using the closed-form solution Equation (40) for K. 
In the former case, a gradient based optimizer can be used with time com- 
plexity 0(^2 n 2 ) for each iteration; while in the later case, the estimation only 
requires ©(fcMIN). 
14: until converges or reach the iteration limit 



5 Infinite mixture of Gaussian processes 

In this section we develop an extension of the model removing the assumption 
that the number of centers k is known in advance. 



5.1 Dirichlet process basics 

We start by reviewing basic concepts for Dirichlet processes. Suppose we have 
i.i.d. data such that 

where J- is an unknown distribution that needs to be inferred from {xt}. A 
Parametric Bayesian approach assumes T is given by a parametric family J-g 
and the parameters 9 follow a certain distribution that comes from our prior 
belief. However, this assumption has limitations both in the scope and the type 
of inferences that can be performed. Instead, nonparametric Bayesian approach 
places a prior distribution on the distribution T directly. The Dirichlet process 
(DP) is used for such purpose. The DP is parameterized by a base distribution 
Go and a positive scaling parameter (or concentration parameter) a. A random 



measure G is distributed according to a DP with base measure Go and scaling 
parameter a if for all finite measurable partitions {-Bj}, i = 1, • • • , k, 

(G(B 1 ), G(B 2 ), ■■■ , G(B k )) ~ Dir(oG (5i), aG (B 2 ), ■■■ , aG (B k )) 

where Dir(-) is the Dirichlet distribution. It is known that G is almost surely a 
discrete measure. 

The Dirichlet process mixture model extends this setting, where the DP is 
used as a nonparametric prior in a hierarchical Bayesian specification. More 
precisely, 

G\{a,G }~VP(a,G ) 

Vn \G~G n = l,2,... 

X n \rjn ~ f{Xn\Vn) 

where / is some probability density function that is parameterized by r\. Data 
generated from this model can be naturally partitioned according to the distinct 
values of the parameter r\ n . Hence, the DP mixture can be interpreted as a 
mixture model where the number of mixtures is flexible and grows as the new 
data is observed. Alternatively, we can view the infinite mixture model as the 
limit of the finite mixture model. Consider the Bayesian finite mixture model 
with a symmetric Dirichlet distribution as the prior of the mixture proportions. 
When the number of mixtures k — 5- 00, the Dirichlet distribution becomes a 
Dirichlet process [?, see]]neal2000markov. 

[22] provides a more explicit construction of the DP which is called the stick- 
breaking construction (SBC). Given {a, Go}, we have two collections of random 
variables V t ~ Beta(I, a) and 77* ~ G , i = {1, 2, • • • , }. The SBC of G is 

t-i 



71* (v) =ViY[(l-Vj) 



J'=l 



G=y"Vj(v)i% 



If we set vk = 1 for some if, then we get a truncated approximation to the DP 

A 



G = y^n{v)5r,i 



4=1 

[23] shows that when selecting the truncation level K appropriately, the trun- 
cated DP behaves very similarly to the original DP. 

5.2 The DP-GMT Model and Inference Algorithm 

In this section, we extend our model by modeling the mixture proportions using 
a DP prior. The plate graph is shown in Figure 2. Under the SBC, the generative 
process is as follows 



1. Draw v s \a ~ Beta(l, a), s = {1, 2, . . .} 

2. Draw / S |/C ~ cxp {-§||/.|& }, a = {1, 2, . . .} 

3. For the jth time series 



s-l, 



(a) Draw Zj\{vi,v 2 , • • •} ~ Discrete(7r(v)), where 7r s (v) = w s IL=i (1 _ l '*); 



(b) Draw/*|/C~exp{-§||/'|&}; 



(c) Draw & \ z 3 , p , x* , tj , a 2 ~ A/" (f (x>), a 2 l) , where fi = £ . * «J t . + /'" . 




Fig. 2. DPGMT: Plate graph 



In this model, the concentration parameter a is assumed to be known. As in 
Equation (12), the inference task is to find the MAP estimates of the parameter 
set A4 = {{f s },{tj},a 2 ,IC}. Notice that in contrast with the previous model, 
the mixture proportion are not estimated here. To perform the inference, we 
must consider another set of hidden variables v = {vi} in addition to f and z. 
However, calculating the posterior of the hidden variables is intractable, thus 
the variational em algorithm [?, e.g.,]]bishop2006pattern is used to perform the 
approximate inference. The algorithm can be summarized as follows: 

— Variational E-Step Choose a family Q of variational distributions q(i, v, z) 
and find the distribution q* that minimizes the Kullback-Leibler (KL) di- 
vergence between the posterior distribution and the proposed distribution 
given the current estimate of parameters, i.e. 

q*{{ , v, z; M 9 ) = argmin KL (Pr(f , v,x\X, 3>; M 9 )\ \q(i , v, z)) (43) 

q&Q 



where 

KL(Pr(f,v,z|A-,y;^ s )||g(f,v,z)) = 



Pr(f,v,z| < Y,^;7W i 



g(f,v,z) 
— Variational M-Step Optimize the parameter set M. such that 



dg(f,v,z). 



M* = argmax Q(M, M 9 ) 
M 



where 

Q(M,M 9 ) =E g . KfjV .^ 9) [log{Pr(^,f,z,v|A';A4) X Pr[{/ S };/C ]}] . 

(44) 

Variational E-Step. For the variational distribution q() we use the mean 
field approximation [24]. That is we assume a factorized distribution for disjoint 
group of random variables. This results in an analytic tractable optimization 
problem. In addition, following [25], we approximate the distribution over v 
using a truncated stick-breaking representations, where for a fix T, q{vx = 1) = 1 
and therefore 7Ts(v) =0, s > T. In this paper, we fix the truncation level T 
while in general it can also be treated as a variational parameter. Concretely, 
we propose the following factorized family of variational distributions over the 
hidden variables {f,v,z}: 

T-l M 

g(f,v,z) = Y[ Qs(v s )Y[q 3 (f 3 ,z :i ). (45) 

8=1 3=1 

Note that we do not assume any parametric form for {q s ,qj} and our only as- 
sumption is that the distribution factorizes into independent components. To op- 
timize Equation (43), recall the following result from [?, ] Chapter 8]bishop2006pattern: 

Lemma 1. Suppose we are given a probabilistic model with a joint distribution 
Pr(X, Z) over X, Z where X = {xi,X2,--- ,Xjv} denote the observed variables 
and all the parameters and hidden variables are Z = {zi,Z2, • • • ,Zm}- Assume 
the distribution of Z has the following form: 

M 

q(Z) =IJ »(«*)■ 
i 

Then, the KL divergence between the posterior distribution Pr(Z|X) and q(Z) 
is minimized and the optimal solution q*{zj) is given by 

q*(z J )oc e ^(]E^[logPr(X,Z)]) 

where ]E^j[- • •] denotes the expectation w.r.t. q() over all Zj, j ^ i. 

From the graphical model in Figure 2, the joint distribution of Pr(^, f, v,z\X; j\4 9 ) 
can be written as: 

Pr{y, f , v,z\X) = Pr(y\X, f , z) Pr(z|v) Pr(f \X) Pr(v|o) 

= n pr(?/ 1**, fj, zj) n Pro*,- iv) n pr&v) n *<■»* w- 

3 3 3 s 

Equivalently, 

logPr(};,f,v,z|AO =^logPr(yV,f^) 



y^logPrfe-|v) + V*logPr(f,|a^) +^logPr(w s |a). 



First we consider the distribution of g s (v). Following [25], the second term can 
be expanded as 



logPrfo|v) = Y^ 1 {z ] >t} log(l - ih) + t {Z]=t} log 



> ! i 



(46) 



where 1 is the indicator function. Therefore, using the lemma above and denoting 
v\v s by v_ s , we have 

\ogq s {v s ) ocE Zjf> _ s [logPr(y,f,v,z|Af)] 

= ^ f^.f.v-, [ 1 {2 J >s}] log(l - v s ) + ]E z ,f^ v _ s [l{ 2j -= s }] loguJ + logPr(w s |a) + constant 

3 

= 2_, (l( z j > s ) l°g(l — v s) + l( z j = s ) logu s ) + logPr(v s |a) + constant 



Recalling that the prior is given by Beta(l, a) we see that the distribution of 
q s (v s ) is 

Observing the form of q s {v s ), we can see that it is a Beta distribution and 
qt(v t ) ~ Beta(7t,i,7t,2) where 



"/ 



i = l + £)9(*J=<) 



Tt,2 =« + X] H 9(«j =0- 

3 l=s+l 

We next consider qj(fj,Zj). Notice that we can always write q 3 (fj,Zj) = 
qj(fj\zj)qj(zj). Denote h(zj) = IE V [logPr(zj|v)], then again using the lemma 
above we have 

q {i \ Z] )q 3 {z ) oc e^Pr^^f^JPr&V) 
= e h ^Pr(y\f ] \x\z 3 ) 
oc e h{Zi) PnW',^-) Pr(fj-|x J ', y\ Zj ) 



e h ^'Pr(y j \x j ,z j ) 



Qj( z j) 



YT{i 3 \x\y\z 3 ) 
v * ' 

<?j( f jl z j) 



The equality in the second line holds because Pr(fj|x J ) = Pr(fj|a; J , z,-); their dis- 
tributions become coupled when conditioned on the observations y J , but without 
such observations they are independent. Therefore the left term yields 



(fcfoOKe^PrCi/V'.Zj) 



where Pr(j/ J '|ar J ', Zj) is given by Equation (16). The value of h(zj) can be calcu- 
lated using Equation (46): 

s-l 

logPr(zj = s|v) = ^ log(l - w t ) + logi> s 

s-l 

fc(Zj = s) = E„ 5 [\ogv s ] + J2 JE Vi [log(l - Vi )] 



! = 1 



where 



E„ t [logu t ] = ^(7i,i) - <p-(7i,i +7i, 2 ) 
E„ s [log(l - Vi)} = ^(7i, 2 ) - ^(7i,i + 7i,2)- 



(47) 



Consequently, Qj(zj) has the following form 



t-i 



#(*j = t) oc exp i JE Vt [log ut] + ^ E„, [(1 - Wi )] [ x AA (V , Kf + a 2 l) . (48) 



Note that this is the same form as in Equation (15) of the previous model where 
Pv(zj]M 9 ) is replaced by e h ( Zi=t \ 

Given Zj, qj(fj\zj) is identical to Equation (17) and leads to the conditional 
distribution such that 

which is the posterior distribution under GP regression and thus is exactly the 
same form as in the previous model. 

Variational M-Step. Denote Q as the expectation of the complete data log 
likelihood w.r.t. the hidden variables. Then as in Equation (20), we have 



Q = E, (v) E, w E, (fW log ] I [Tr.MPrG^'Ifj,^ - a; /*)&(&', M)]'» 



3 « 



= E V 



{ j s j 

x log [tt.(v) Pr(y»'|£j,^ = a;M) Pr(f,; M)} 

^2i( z )\ 5ZS^'» ■ / d Q( f o\zj = *) 

1 i » ^ 

x log [Pr(^ |f, , ^ = a; M) Pr(f, ; M)] 1 + E v 



^^log7r s (v) 



(49) 



Notice that E v ^\- ^ s log7r s (v) is a constant w.r.t. the parameters of A4 and 
can be dropped in the optimization. Thus, following the same derivation as in 
the GMT model, we have the form of the Q function as 

Q(M,M g ) = -\j2 ll/'l&o - E n i lo 8' CT 

s 3 

3 s 

+ EE^%i)}NPr(fi;M)]. 

(50) 

where 7j S is given by Equation (22). Now because the Qj{zj) and qj(fj\zj) 
have exactly the same form as before (except Pt(zj;A4 9 ) is replaced by Equa- 
tion (48)), the previous derivation of the M-Step w.r.t. the parameter set M. 
still holds. 

To summarize, the algorithm is the same as Algorithm 1 except that 

— we drop step 6, 

— we add a step between steps 3 and 4 calculating 7^1 and 7^2 using Equa- 
tion (47), 

— step 4 calculating Equation (22) uses Equation (48) instead of Equation (15). 



6 Experiments 

Our implementation of the algorithm makes use of the gpml package [26] and 
extends it to implement the required functions. The em algorithm is restarted 5 
times and the function that best fits the data is chosen. The em algorithm stops 
when difference of the log-likelihood is less than 10e-5 or at a maximum of 200 
iterations. 



6.1 Regression on Synthetic data 

In the first experiment, we demonstrate the performance of our algorithm on a 
regression task with artificial data. We generated the data following Assumption 
1 under a mixture of three Gaussian processes. More precisely, each f s (x),s = 
1, 2, 3 is generated on the interval [—50, 50] from a Gaussian process with covari- 
ance function 

cov[f s (t 1 )J s (t 2 )]=e- it1 ^ 1 , 5 = 1,2,3. 

The individual effect fj is sampled via a Gaussian process with the covariance 
function 

cov[/ J -(ti),/ J -(i 2 )]=0.2e- lll # i . 



Then the hidden label Zj is sampled from a discrete distribution with the param- 
eter a = [0.5,0.5]. The vector x consists of 100 samples on [— 50, 50] 5 . We fix a 
sample size N, each x J includes N randomly chosen points from {xi, • • • , £100} 
and the observation fi(x 3 ') is obtained as (f Zj + fj)(x 3 ). In the experiment, 
we vary the individual sample length N from 5 to 50. Finally, we generated 50 
random tasks with the observation y J for task j given by 

y i ~N(fi(x j ),0.01 xl), j = !,-■■ ,50. 

The methods compared here include 

1. Single-task learning procedure (ST), where each / J is estimated only 
using {x J i ,yj},i = 1,2,- ■• ,N. 

2. Single center mixed-effect multi-task learning (SCMT), amounts to 
the mixed-effect model [9] where one average function / is learned from 
{xJ,y*},j = 1, • • • , 50 and p = f + P,j = 1, • ■ • , 50. 

3. Grouped mixed-effect model (GMT), the proposed method with num- 
ber of clusters fixed to be the true model order. 

4. Dirichlet process Grouped mixed-effect model (DP-GMT), the in- 
finite mixture extension of the proposed model. 

5. "Cheating" grouped fixed-effect model (CGMT), which follows the 
same algorithm as the grouped mixed-effect model but uses the true label Zj 
instead of their expectation for each task j. This serves as an upper bound 
for the performance of the proposed algorithm. 

All algorithms (except for ST which does not estimate the kernel of the individ- 
ual variations) use the same method to learn the kernel of the individual effects, 
which is assumed to have the form 

( f l-*2) 2 

cov[fj{ti),fj{t 2 )] = ae ?■ . 

The Root Mean Square Error (RMSE) for the four approaches is reported. For 
task j, the RMSE is defined as 



1 

100 ' 



RMSE j = A / — \\f(x)- fi(x) 



where / is the learned function and RMSE for the data set is the mean of 
{RMSEj}, j = 1, ••• ,50. To illustrate the results qualitatively, we first plot in 
Figure 3 the true and learned functions in one trial. The left/center/right column 
illustrates some task that is sampled from group effect /1, fa and fs. It it easy 
to see that, as expected, the tasks are poorly estimated under ST due the sparse 
sampling. The SCMT performs better than ST but its estimate is poor in areas 
where the three centers disagree. The estimates of GMT are much closer to the 
true function. 

Figure 4 shows a comparison of the algorithms for 50 random data sets under 
the above setting when N equals 5. We see that GMT with the correct model 



The samples are generated via Matlab command: linspace(-50,50,100). 




Fig. 3. Simulated data: Comparison of the estimated function between single, multi- 
task and grouped multi-task. The red dotted line is the reference true function. 



order k — 3 almost always performs as well as its upper bound, illustrating 
that it recovers the correct membership of each task. On only three data sets, 
our algorithm is trapped in a local maximum yielding performance similar to 
SCMT and ST. Figure 5 shows the RMSE for increasing values of N for the 
same experimental setup. From the plot we can draw the conclusion that the 
proposed method works much better than SCMT and ST when the number of 
samples is less than 30. As the number of samples for each task increases, all 
methods are improving, but the proposed method always outperforms SCMT 
and ST in our experiments. Finally, all algorithms converge to almost the same 
performance level where observations in each task are sufficient to recover the 
underlying function. Finally, Figure 5 also includes the performance of the DP- 
GMT on the same data. The truncation level of the Dirichlct process is 10 and 
the concentration parameter a is set to be 1. As we can see the DP-GMT is 
not distinguishable from the GMT (which has the correct k) , indicating that the 
model selection is successful in this example. 



6.2 Classification on Astrophysics data 



The concrete application motivating this research is the classification of stars 
into several meaningful categories from the astronomy literature. Classification 
is an important step within astrophysics research, as evidenced by published 
catalogs such as OGLE [27] and MACHO [28, 29]. However, the number of stars 
in such surveys is increasing dramatically. For example Pan-STARRS [30] and 
LSST [31] collect data on the order of hundreds of billions of stars. Therefore, it is 
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Fig. 4. Simulated data: Comparison between single, multi-task, and grouped multi-task 
when sample size is 5. The figure gives 3 pairwise comparison. The Blue stars denote 
ST vs. GMT: we can see the GMT is better than ST since the stars are concentrated on 
the lower right. Similarly, the plot of red pluses demonstrates the advantage of GMT 
over SCMT and the plot of green triangles shows that the algorithm behaves almost 
as well as its upper bound. 
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Fig. 5. Simulated data: Performance comparison of single, multi-task, and grouped 
multi-task, and DP grouped multi-task as a function of the number of samples per 
task. 




Fig. 6. Examples of light curves of periodic variable stars. Each column shows two 
stars of the same type. Left: Cepheid, middle: RR Lyrae, right: eclipsing binary. 



desirable to apply state-of-art machine learning techniques to enable automatic 
processing for astrophysics data classification. 

The data from star surveys is normally represented by time series of bright- 
ness measurements, based on which they arc classified into categories. Stars 
whose behavior is periodic are especially of interest in such studies. Figure 6 
shows several examples of such time series generated from the three major types 
of periodic variable stars: Cepheid, RR Lyrae, and Eclipsing Binary. In our ex- 
periments only stars of these classes are present in the data, and the period of 
each star is given. 

From Figure 6, it can be noticed that there are two main characteristics of 
this data set: 



— The time series are not phase aligned, meaning that the light curves in the 
same category share a similar shape but with some unknown shift. 

— The time series are non-synchronously sampled and each light curve has 
different number of samples and sampling times. 

We run our experiment on the OGLEII data set [32]. This data set consists of 
14087 time series from periodic variable stars with 3425 Cepheids, 3390 EBs and 
7272 RRLs. We use the time series measurements in the I band [32]. We perform 
several experiments with this data set to explore the potential of the proposed 
method. In previous work with this dataset [33] developed a kernel for periodic 
time series and used it with SVM to obtain good classification performance. We 
use the results of [33] as our baseline. 6 



[33] used additional features, in addition to time series itself, to improve the classifi- 
cation performance. Here we focus on results using the time series only. Extensions 
to add such features to our model are orthogonal to the theme of the paper and we 
therefore leave them to future work in the context of the application. 





UP + GMM 


GMT 


UP + 1-NN 


K + SVM 


Results 


0.956 ± 0.006 0.952 ± 0.005 0.865 ± 0.006 0.947 ± 0.005 



Table 1. Accuracies with standard deviations reported on OGLEII dataset. 



Classification using dense-sampled time series In the first experiment, 
the time series are smoothed using a simple average filter, re-sampled to 50 
points via linear-interpolation and normalized to have mean and standard 
deviation of 1. Therefore, the time series are synchronously sampled in the pre- 
processing. We compare our method to Gaussian mixture model (GMM) and 
1-Nearest Neighbor (1-NN). These two approaches are performed on the time 
series processed by Universal phasing (UP), which uses the method from [21] 
to phase each time series according to the sliding window on the time series 
with the maximum mean. We use a sliding window size of 5% of the number of 
original points; the phasing takes place after the pre-processing explained above. 
We learn a separate model for each class and for each class the model order for 
GMM and GMT is set to be 15. 

We run 10-fold cross-validation (CV) over the entire data set and the results 
are shown in Table 1. We see that when the data is densely and synchronously 
sampled, the proposed method performs similar to the GMM, and they both 
outperform the kernel based results of [33]. The similarity of the GMM and 
the proposed method under these experimental conditions is not surprising. The 
reason is that when the time series are synchronously sampled, aside from the 
difference of phasing, finding the group effect functions is reduced to estimating 
the mean vectors of the GMM. In addition, learning the kernel in the non- 
parametric approach is equivalent to estimating the covariance matrix of the 
GMM. More precisely, assuming all time series are phased (that is, tj = for all 
j), the following results hold: 

1. By placing a flat prior on the group effect function / s , s = 1, • • ■ , k, or 
equivalently setting ||/ s ||^ = 0, Equation (28) is reduced to finding a vector 
/.j s 6 IN that minimizes ^2jljs\\y^ — Hs\\ 2 - Therefore, we obtain f s = /i s = 
Y]. jjgijj/ ^ . 7 JS , which is exactly the mean of the sth cluster during the iter- 
ation of em algorithm under the GMM setting. 

2. The kernel K is learned in a non-parametric way. For the GP regression 
model, we see that considering noisy observations is essentially equivalent to 
considering non-noisy observations, but slightly modifying the model by adding 
a diagonal term on the covariance function for f} . Therefore, instead of estimating 
K and <r 2 , it is convenient to put these two terms together, forming K = K + 
<7 2 I. In other words, we add a a 2 term to the variance of f, and remove it 
from y J which becomes deterministic. In this case, comparing to the derivation 
in Equation (16) — (19) we have f) = yi — P and fj is determined given Zj. 
Comparing to Equation (17) we have the posterior mean /x? ~ KK _1 (y J - /j, s ) = 
yJ — /i s and the posterior covariance matrix C® vanishes. Applying these values 
in Equation (40) we get K = jj V . J^ s lisiy^ — ^s)(y j — Ms) T - In the standard 



EM algorithm for the GMM, this is equal to the estimated covariance matrix 
when all k clusters are assumed to have the same variance. 

Accordingly, when time series are synchronously sampled, the proposed model 
can be viewed as an extension of the Phased K- means [10]. The Phased K-means 
(PKmcans) re-phases the time series before the similarity calculation and up- 
dates the centroids using the phased time series. Therefore, with shared covari- 
ance matrix, our model is a shift-invariant (Phased) GMM and the corresponding 
learning process is a Phased EM algorithm where each time series is re-phased 
in the E step. In experiments presented below we use Phased GMM directly in 
the feature space and generalize it so that each class has a separate covariance 
matrix. 

We use the same experimental data to investigate the performance of the DP- 
GMT where the truncation level is set to be 30 and the concentration parameter 
a of the DP is set to be 1 . The results are shown in Figure 7 and Table 2 where 
BIC-GMT means that the model order is chosen by BIC where the optimal k 
is chosen from 1 to 30. The poor performance of SCMT shows that a single 
center is not sufficient for this data. As evident from the graph the DP-GMT is 
not distinguishable from the BIC-GMT. The advantage of the DP model is that 
this equivalent performance is achieved with much reduced computational cost 
because the BIC procedure must learn many models and choose among them 
whereas the DP learns a single model. 
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Fig. 7. OGLEII data: Comparison of model selection methods using dense sampled 
data. The plot shows the performance of GMT with varying k, BIC for the GMT 
model, and DP-GMT. For visual clarity we only include the standard deviations on 
the GMT plot. 





SCMT 


GMT 


DP-GMT 


BIC-GMT 


Results 


0.874 ± 0.008 0.952 ± 0.005 0.949 ± 0.005 0.950 ± 0.002 



Table 2. Accuracies with standard deviations reported on OGLEII dataset. 



Classification using sparse-sampled time series The OGLEII data set is 
in some sense a "nice" subset of the data from its corresponding star survey. 
Stars with small number of samples are often removed in pre-processing steps. 
For example, [34] developed full system to process the MACHO catalog and 
applied the kernel method to classify stars. In its pipeline, part of the prepro- 
cessing rejected 3.6 million light curves of the approximate 25 million because of 
insufficient number of observations. The proposed method potentially provides 
a way to include these instances in the classification process. In the second ex- 
periment, we demonstrate the performance of the proposed method on times 
series with sparse samples. Similar to the synthetic data, we started from sub- 
sampled versions of the original time series to simulate the condition that we 
would encounter in further star surveys. 7 As in the previous experiment, each 
time series is universally phased, normalized and linearly-interpolated to length 
50 to be plugged into GMM and 1-NN as well as the phased GMM mentioned 
above. The RBF kernel is used for the proposed method and we use model order 
15 as above. Moreover, the performance for PKmeans is also presented, where 
the classification step is as follows: we learn the PKmeans model with k = 15 
for each class and then the label of a new example is assigned to be the same 
as its closest centroid's label. PKmeans is also restarted 5 times and the best 
clustering is used for classification. 

The results are shown in Figure 8. As can be easily observed, when each 
time series has sparse samples (i.e., number of samples per task is less than 30), 
the proposed method has a significant advantage over the other methods. As 
the number of samples per task increases, the proposed method improves fast 
and performs close to its optimal performance given by previous experiment. 
Three additional aspects that call for discussion can be seen in the figure. First, 
note that for all three methods, the performance with dense data is lower than 
results in Table 1. This can be explained by fact that the data set obtained 
by the interpolation of the sub-sampled measurements contains less information 
than that interpolated from the original measurements. Second, notice that the 
Phased EM algorithm always outperforms the GMM plus UP demonstrating 
that re-phasing the time series inside the em algorithm improves the results. 
Third, when the number of samples increases, the performance of the Phased em 
gradually catches up and becomes better than the proposed method when each 
task has more than 50 samples. GMM plus universal phasing (UP) also achieves 
better performance when time series are densely sampled. One reason for the 



7 For the proposed method, we clip the samples to a fine grid of 200 equally spaced 
time points on [0, 1], which is also the set of allowed time shifts. This avoids having 
a very high dimensional x, e.g. over 18000 for OGLEII, which is not feasible for any 
kernel based regression method that relies on solving linear systems. 
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Fig. 8. OGLEII data: Comparison of algorithms with sparsely sampled data 
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Fig. 9. OGLEII data: Comparison of algorithms with sparsely sampled data 



performance difference is the difference in the way the kernel is estimated. In 
Figure 8 GMT uses the parametric form of the kernel which is less expressive than 
getting precise estimates for every K,(tx,t2). The GMM uses the non-parametric 
form which, given sufficient data, can lead to better estimates. A second reason 
can be attributed to the sharing of the covariance function in our model where 
the GMM and the Phased GMM do not apply this constraint. 

Finally, we use the same experimental setting to compare the performance 
of various mode selection models. The results are shown in Figure 9. The per- 
formance of BIC is not distinguishable from the optimal k selected in hindsight. 
The performance of DP is slightly lower but it comes close to these models. 

To summarize, wc conclude from the experiments with astronomy data that 
Phased EM is appropriate with densely sampled data but that the GMT and its 
variants should be used when data is sparsely and non-synchronously sampled. 
In addition BIC coupled with GMT performs excellent model selection and DP 
does almost as well with a much reduced computational complexity. 

Class discovery: Wc show the potential of our model for class discovery 
by running two version of the GMT model on the joint data set of the three 
classes (not using the labels). Then, each cluster is labeled according to the 
majority class of the instances that belong to the center. For a new test point, 
we determine which cluster it belongs to via the MAP probability and its label is 
given by the cluster that it is assigned to. We run 10 trials with different random 
initializations. In accordance with previous experiments that used 15 components 
per class wc run GMT with model order of 45. We also run DP-GMT with a 
truncation level set to 90. The GMT obtains accuracy and standard deviation 
of [0.895,0.010] and the DP models obtains accuracy and standard deviation of 
[0.925,0.013]. Note that it is hard to compare between the results because of 
the different model orders used. Rather than focus on the difference, the striking 
point is that we obtain almost pure clusters without using any label information. 
Given the size of the data set and the relatively small number of clusters this is 
a significant indication of the potential for class discovery in astrophysics. 

7 Related Work 

Classification of time series has attracted an increasing amount of interest in 
recent years due to its wide range of potential applications, for example ECG 
diagnosis [35], EEG diagnosis [16], and Speech Recognition [36]. Common meth- 
ods choose some feature based representation or distance function for the time 
series (for example the sampled time points, or Fourier or wavelet coefficients 
as features and dynamic time warping for distance function) and then apply 
some existing classification method [37,38]. Our approach falls into another 
category, that is, model-based classification where the time series are assumed 
to be generated by a probabilistic model and examples are classified using max- 
imum likelihood or MAP estimates. A family of such models, closely related to 
the GMT, is discussed in detail below. Another common approach uses Hidden 



Markov models as a probabilistic model for sequence classification, and this has 
been applied to time series as well [39]. 

Learning Gaussian processes from multiple tasks has previously been investi- 
gated in the the hierarchical Bayesian framework, where a group of related tasks 
are assumed to share the same prior. Under this assumption, training points 
across all tasks are utilized to learn a better covariance function via the EM 
algorithm [7,8]. In addition, [16] extended the work of [8] to a non-parametric 
mixed-effect model where each task can have its own random effect. Our model 
is based on the same algorithmic approach where the values of the function for 
each task at its corresponding points (i.e. {fj} in our model) are considered as 
hidden variables. Furthermore, the proposed model is a natural generalization 
of [8] where the fixed-effect function is sampled from a mixture of regression 
functions each of which is a realization of a common Gaussian process. Along 
a different dimension, our model differs from the infinite mixtures of Gaussian 
processes model for clustering [40] in two aspects: first, instead of using zero 
mean Gaussian process, we allow the mean functions to be sampled from an- 
other Gaussian process; second, the individual variation in our model serves as 
the covariance function in their model but all mixture components share the 
same kernel. 

Although having a similar name, the Gaussian process mixture of experts 
model focuses mainly on the issues of non-stationarity in regression [41,42]. By 
dividing the input space into several (even infinite) regions via a gating network, 
the Gaussian process mixture of expert model allows different Gaussian processes 
to make predictions for different regions. 

In terms of the clustering aspect, our work is most closely related to the so- 
called mixture of regressions [43, 11, 44, 45]. The name comes from the fact that 
these approaches substitute component density models with conditional regres- 
sion density models in the framework of standard mixture model. For phased 
time series, [45] first proposed the regression-based mixture model where they 
used Polynomial and Kernel regression models for the mean curves. Further, 
[11] integrated the linear random effects models with mixtures of regression 
functions. In their model, each time series is sampled by a parametric regression 
model whose parameters are generated from a Gaussian distribution. To incor- 
porate the time shifts, [46] proposed a shift-invariant Gaussian mixture model 
for multidimensional time series. They constrained the covariance matrices to 
be diagonal to handle the non-synchronous case. They also treated time shifts 
as hidden variables and derived the em algorithm under full Bayesian settings, 
i.e. where each parameter has a prior distribution. Furthermore, [43] developed 
a generative model for misaligned curves in a more general setting. Their joint 
clustering-alignment model also assumes a normal parametric regression model 
for the cluster labels, and Gaussian priors on the hidden transformation vari- 
ables which consist of shifting and scaling in both the time and magnitude. Our 
model extends the work of [11] to admit non-parametric Bayesian regression 
mixture models and at the same time handle the non-phased time series. If the 
group effects are assumed to have a flat prior, our model differs from [46] in the 



following two aspects in addition to the difference of Bayesian treatment. First, 
our model does not include the time shifts as hidden variables but instead esti- 
mates them as parameters. Second, we can handle shared full covariance matrix 
instead of diagonal ones by using a parametric form of the kernel. On the other 
hand, given the time grid as, we can design the kernel for individual variations 
as K,(xi,Xj) = a,idij(xi,Xj),i,j = 1, ••■ , IN. Using this choice, our model is the 
same as [46] with shared diagonal covariance matrix. In summary, our model 
allows a more flexible structure of the covariance matrix that can treat synchro- 
nized and non-synchronized time series in a unified framework, but at the same 
time it is constrained to have the same covariance matrix across all clusters. 



8 Conclusion 



We developed a novel Bayesian nonparamctric multi-task learning model (GMT) 
where each task is modeled as a sum of a group-specific function and an indi- 
vidual task function with a Gaussian process prior. We also extended the model 
such that the number of groups is not bounded using a Dirichlet process mix- 
ture model (DP-GMT). We derive efficient EM and variational EM algorithms 
to learn the parameters of the models and demonstrated their effectiveness us- 
ing experiments in regression, classification and class discovery. Our models are 
particularly useful for sparsely and non-synchronously sampled time series data, 
and model selection can be effectively performed with these models. 

There are several natural directions for future work. For application in the 
astronomy context it is important to consider all steps of processing and classifi- 
cation of a new sky survey so as to provide an end to end system. Therefore, two 
important issues to be addressed in future work include incorporating the pe- 
riod estimation phase into the method and developing an appropriate method for 
abstention in the classification step. It would also be interesting to develop a cor- 
responding discriminative model extending [5] to the GP context. Finally, one of 
the drawbacks of the GP based methods is the computational complexity which 
is too high for large scale problems. For example, in the experiments on sparse 
OGLEII data, we had to resample the data on a fine grid to avoid performing 
Cholcsky decomposition for high dimensional matrices. Therefore, an important 
direction for future work is to find non-trivial sparse GP approximations that 
yield good performance with the GMT model. 
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